Welcome! This template will guide you through a Bayesian analysis in R, even if you have never done Bayesian analysis before. This template assumes you have basic familiarity with R. Once complete, this template will produce a summary of the analysis, complete with parameter estimates and credible intervals, and two animated HOPs (see Hullman, Resnick, Adar 2015 DOI: 10.1371/journal.pone.0142444 and Kale, Nguyen, Kay, and Hullman VIS 2018 for more information) for both your prior and posterior estimates.
TODO Paul: we discussed putting in a v small section about why we want to use Bayesian analysis - is this already too long? This Bayesian analysis focuses on producing results in a form that are easily interpretable, even to nonexperts. The credible intervals produced by Bayesian analysis are the analogue of confidence intervals in traditional null hypothesis significance testing (NHST). A weakness of NHST confidence intervals is that they are easily misinterpreted [sources for all of this]. Many people naturally interpret an NHST 95% confidence interval to mean that there is a 95% chance that the true parameter value lies somewhere in that interval; in fact, it means that if the experiment were repeated 100 times, 95 of the resulting confidence intervals would include the true parameter value. THe Bayesian credible interval sidesteps this complication by providing the intuitive meaning: a 95% chance that the true parameter value lies somewhere in that interval. To further support intuitive interpretations of your results, this template also produces animated HOPs plots, a type of plot that is more effective than visualizations such as error bars in helping people make accurate judgments about probability distributions.
This set of templates supports a few types of statistical analysis. (In future work, this list of supported statistical analyses will be expanded.) For clarity, each type has been broken out into a separate template, so be sure to select the right template before you start! A productive way to choose which template to use is to think about what type of chart you would like to produce to summarize your data. Currently, the templates support the following:
Simple bar chart (e.g. t-tests, one-way ANOVA)
Bar chart with groups (e.g. two-way ANOVA)
Line chart with groups (e.g. two-way ANOVA)
Linear regression
TODO Chanda: finish this list once the template list is finalized
This template will produce a line graph of a linear regression. If your analysis includes a regression with a continuous independent variable, this template may be for you.
Once you have selected your template, to complete the analysis, please follow along this template. For each code chunk, you may need to make changes to customize the code for your own analysis. In those places, the code chunk will be preceded by a list of things you need to change (with the heading “What to change”), and each line that needs to be customized will also include the comment #CHANGE ME within the code chunk itself.
Good luck!
If this is your first time using the template, you may need to install libraries. Uncomment the lines below - install.packages() and devtools::install_github() - to install the required packages. This only needs to be done once.
knitr::opts_chunk$set(cache = TRUE)
# install.packages("rstanarm", "tidyverse", "tidybayes", "modelr", "devtools")
# devtools::install_github("thomasp85/gganimate")
library(rstanarm)
library(tidyverse)
library(tidybayes)
library(modelr)
library(gganimate)
Read in the dataset, choose your independent and dependent variables. These are the variables that will correspond to the x and y axis on the final plots. Note: This template requires your data to be normally distributed. The templates currently do not support non-normal data.
What to change
mydata: Read in your data.
mydata$x, mydata$y: Select which variables will appear on the x- and y-axis of your plots.
mydata <- read.csv('datasets/choc_cleaned_data.csv') #CHANGE ME
mydata$x <- mydata$num_products_displayed #CHANGE ME
mydata$y <- mydata$satis_Q1 #CHANGE ME
You can set the aesthetics of your graphs here.
What to change
x_lab & y_lab: Label your x- and y-axes.
[Optional] coord_cartesian(ylim = …): Depending on your data, you may want to manually set the y-axis limits. If so, uncomment this line in the code below and set your preferred limits accordingly.
In most cases, the other default values here should be just fine. If you want to adjust the aesthetics of the animated plots later, you can do so here; just be sure to keep the lines that are commented with “do not change.”
theme_set(theme_light()) # set the ggplot theme for all plots
# label the axes on the plots
x_lab = "# Choices" #CHANGE ME
y_lab = "Satisfaction" #CHANGE ME
# the default code for the plots - if needed, the animated plot aesthetics can be customized here
graph_plot <- function(data) {
ggplot(data, aes(x = x, y = .value)) + #do not change
geom_line() + #do not change
transition_states(.draw, transition_length = 1, state_length = 1) + # gganimate code to animate the plots. Do not change
coord_cartesian(ylim = c(min(mydata$y, na.rm=T), max(mydata$y, na.rm=T))) + # sets axis limits - CHANGE ME (optional)
labs(x=x_lab, y=y_lab) # axes labels
}
# Animation parameters
n_draws = 100 # the number of draws to visualize in the HOPs plots
frames_per_second = 2.5 # the speed of the HOPs
# 2.5 frames per second (400ms) is the recommended speed for the HOPs visualization.
# Faster speeds (100ms) have been demonstrated to not work as well.
# See Kale et al. VIS 2018 for more info.
We’ll fit the following model: stan_glm(y ~ x), which specifies a linear regression where each \(y_i\) is drawn from a normal distribution with mean equal to \(a + bx_i\) and standard deviation equal to sigma (\(\sigma\)):
\[ y_i \sim Normal(a + bx_i, \sigma) \] ###Set priors In this section, you will set priors for your model. Setting priors thoughtfully is critical to any Bayesian analysis. They tell your model your best prior belief of what reasonable estimates for prior values might be. Ideally, you will have previous literature from which to draw these prior estimates. If no previous studies exist, you can instead assign “weakly informative priors” that only minimially restrict the model; for example, a weakly informative prior for a parameter that can only have values between 1 and 7 would assign a very small probability to values outside of that range. We have provided an example of how to set priors below.
To check the plausibility of your priors, use the code section after this one to generate a graph of 100 plausible fit lines from your priors to check if the values generated are reasonable.
What to change
(Parameter means): For each parameter, assign a mean prior value. This template currently only supports setting the intercept parameter (overall mean/DF) and one additional prior mean and SD that are used across all levels of the factor.
(Parameter standard deviations): As above, for each parameter, assign a value for the standard deviation of its prior.
# taken from Iyengar & Lepper 2000
#0 choices: (M = 4.92, SD = 0.98, n=67); 6 choices: (M = 6.28, SD = 0.54, n=33); 30 choices: (M = 5.46, SD = 0.82, n=34)
prior_mean <- 6.28*(33/67) + 5.46*(34/67) #5.863881
# in absence of prior literature, we set priors so that
# a change from the lowest to highest choice set size
# results in a -2 change in satisfaction and the mean
# satisfaction across all choice set sizes is still 5.86
mean_choice_set_size <- (12 + 24 + 40 + 50 + 60 + 72) / 6
range_choice_set_size <- 72-12
b1_prior <- -2/range_choice_set_size #-0.03333333
b1_sd <- 0.01
a_prior <- prior_mean - b1_prior*mean_choice_set_size #7.297214
a_sd <- 2 #flat sd for a > 5.86
Next, you’ll want to check your priors by running this code chunk. Drawing from your prior distribution, it will display a selection of 100 plausible fit lines, so you can check to see if the values generated are reasonable. (We’ll go into the details of this code later.)
What is “reasonable” depends on what you know about the expected values before looking at the data from the study. If you do not have previous studies to help set these priors, it is usually best to set weakly informative priors with large standard deviations that only make the very extreme and impossible values unlikely. If you have previous literature to use as a guide in setting priors, you can use the mean estimate SD from that paper to assign a narrower prior distribution.
What to change
Nothing! Just run this code to check your priors, adjusting prior values above as needed until you find reasonable prior values.
#generate sample draws from the priors
m_prior = stan_glm(y ~ x, data = mydata,
prior_intercept = normal(a_prior, a_sd, autoscale = FALSE),
prior = normal(b1_prior, b1_sd, autoscale = FALSE),
prior_PD = TRUE
)
#create the dataframe for fitted draws & create the plot
mydata %>%
data_grid(x = seq_range(x, n = 101)) %>%
add_fitted_draws(m_prior, n = 100, seed = 12345) %>%
ggplot(aes(x = x, y = .value)) +
geom_line(aes(group = .draw), alpha = .2) +
labs(x=x_lab, y=y_lab) # axes labels
stan_glm() will place a default prior on the standard deviation (\(\sigma\)); we can keep this. The main priors of interest are those on \(a\) and \(b\), which correspond to the intercept and slope (respectively) of the relationship between \(x\) and \(y\). We can set those priors using the prior_intercept and prior arguments to stan_glm:
m = stan_glm(y ~ x, data = mydata,
prior_intercept = normal(a_prior, a_sd, autoscale = FALSE),
prior = normal(b1_prior, b1_sd, autoscale = FALSE)
)
Here is a summary of the model fit:
summary(m, digits=3)
##
## Model Info:
##
## function: stan_glm
## family: gaussian [identity]
## formula: y ~ x
## algorithm: sampling
## priors: see help('prior_summary')
## sample: 4000 (posterior sample size)
## observations: 611
## predictors: 2
##
## Estimates:
## mean sd 2.5% 25% 50% 75%
## (Intercept) 6.225 0.096 6.037 6.159 6.226 6.291
## x -0.002 0.002 -0.006 -0.003 -0.002 -0.001
## sigma 1.027 0.030 0.972 1.007 1.027 1.047
## mean_PPD 6.136 0.060 6.019 6.096 6.136 6.175
## log-posterior -891.867 1.247 -894.992 -892.433 -891.547 -890.960
## 97.5%
## (Intercept) 6.416
## x 0.002
## sigma 1.087
## mean_PPD 6.254
## log-posterior -890.441
##
## Diagnostics:
## mcse Rhat n_eff
## (Intercept) 0.002 1.000 3401
## x 0.000 1.000 3361
## sigma 0.000 1.000 4000
## mean_PPD 0.001 1.000 4000
## log-posterior 0.028 1.001 2022
##
## For each parameter, mcse is Monte Carlo standard error, n_eff is a crude measure of effective sample size, and Rhat is the potential scale reduction factor on split chains (at convergence Rhat=1).
Given this model, we might want to plot the fit line with credible bands around it. To do that, we will first construct a fit grid: a data frame of points at which we want to calculate the value of the fit line from the model. The data_grid function allows us to do this easily, e.g. by asking for 20 equally spaced points along the value of the x variable in our original data:
mydata %>%
data_grid(x = seq_range(x, n = 20))
## # A tibble: 20 x 1
## x
## <dbl>
## 1 12
## 2 15.2
## 3 18.3
## 4 21.5
## 5 24.6
## 6 27.8
## 7 30.9
## 8 34.1
## 9 37.3
## 10 40.4
## 11 43.6
## 12 46.7
## 13 49.9
## 14 53.1
## 15 56.2
## 16 59.4
## 17 62.5
## 18 65.7
## 19 68.8
## 20 72
Given this grid, we can then draw samples from the posterior mean evaluated at each x position in the grid using the add_fitted_draws function, and then summarize these samples in ggplot using a stat_lineribbon:
mydata %>%
data_grid(x = seq_range(x, n = 20)) %>%
add_fitted_draws(m) %>%
ggplot(aes(x = x, y = .value)) +
stat_lineribbon() +
# coord_cartesian(ylim = c(min(mydata$y, na.rm=T), max(mydata$y, na.rm=T))) + # sets axis limits - CHANGE ME (optional)
scale_fill_brewer()
But what we really want is to display a selection of plausible fit lines, say 100 of them. To do that, we instead ask add_fitted_draws for only 100 draws, which we plot separately as lines:
mydata %>%
data_grid(x = seq_range(x, n = 101)) %>%
# the seed argument is for reproducibility: it ensures the pseudo-random
# number generator used to pick draws has the same seed on every run,
# so that someone else can re-run this code and verify their output matches
add_fitted_draws(m, n = 100, seed = 12345) %>%
ggplot(aes(x = x, y = .value)) +
# coord_cartesian(ylim = c(min(mydata$y, na.rm=T), max(mydata$y, na.rm=T))) + # sets axis limits - CHANGE ME (optional)
geom_line(aes(group = .draw), alpha = .2)
Or even better, to animate these:
p = mydata %>%
data_grid(x = seq_range(x, n = 101)) %>%
add_fitted_draws(m, n = n_draws, seed = 12345)
# the seed argument is for reproducibility: it ensures the pseudo-random
# number generator used to pick draws has the same seed on every run,
# so that someone else can re-run this code and verify their output matches
animate(graph_plot(p), nframes = n_draws * 2, fps = frames_per_second)
For more context, we could also show the fit lines with the data:
animate(graph_plot(p) + geom_count(aes(y = y), data = mydata), nframes = n_draws * 2, fps = frames_per_second)
We already looked at some sample plots of the priors when we were setting priors; now we want to look at these priors again, but in a HOPs format so we can compare to the posterior plots. To get the prior plots, we can simply ask stan_glm to sample from the prior. If you are knitting this document, or if you already ran the code in the “Check priors” section that calculates m_prior, you can comment out this line:
#m_prior = update(m, prior_PD = TRUE)
Then our code to generate plots is identical, except we replace m with m_prior:
p_prior = mydata %>%
data_grid(x = seq_range(x, n = 101)) %>%
add_fitted_draws(m_prior, n = n_draws, seed = 12345)
animate(graph_plot(p_prior), nframes = n_draws * 2, fps = frames_per_second)
Again, with context:
animate(graph_plot(p_prior) + geom_count(aes(y = y), data = mydata), nframes = n_draws * 2, fps = frames_per_second)